12. Summary

Summary

Objective function

Objective function

### Policy-Based Methods

  • With value-based methods, the agent uses its experience with the environment to maintain an estimate of the optimal action-value function. The optimal policy is then obtained from the optimal action-value function estimate.
  • Policy-based methods directly learn the optimal policy, without having to maintain a separate value function estimate.

### Policy Function Approximation

  • In deep reinforcement learning, it is common to represent the policy with a neural network.
    • This network takes the environment state as input.
    • If the environment has discrete actions, the output layer has a node for each possible action and contains the probability that the agent should select each possible action.
  • The weights in this neural network are initially set to random values. Then, the agent updates the weights as it interacts with (and learns more about) the environment.

### More on the Policy

  • Policy-based methods can learn either stochastic or deterministic policies, and they can be used to solve environments with either finite or continuous action spaces.

### Hill Climbing

  • Hill climbing is an iterative algorithm that can be used to find the weights \theta for an optimal policy.
  • At each iteration,
    • We slightly perturb the values of the current best estimate for the weights \theta_{best}, to yield a new set of weights.
    • These new weights are then used to collect an episode. If the new weights \theta_{new} resulted in higher return than the old weights, then we set \theta_{best} \leftarrow \theta_{new}.

### Beyond Hill Climbing

  • Steepest ascent hill climbing is a variation of hill climbing that chooses a small number of neighboring policies at each iteration and chooses the best among them.
  • Simulated annealing uses a pre-defined schedule to control how the policy space is explored, and gradually reduces the search radius as we get closer to the optimal solution.
  • Adaptive noise scaling decreases the search radius with each iteration when a new best policy is found, and otherwise increases the search radius.

### More Black-Box Optimization

  • The cross-entropy method iteratively suggests a small number of neighboring policies, and uses a small percentage of the best performing policies to calculate a new estimate.
  • The evolution strategies technique considers the return corresponding to each candidate policy. The policy estimate at the next iteration is a weighted sum of all of the candidate policies, where policies that got higher return are given higher weight.

### Why Policy-Based Methods?

  • There are three reasons why we consider policy-based methods:
    1. Simplicity: Policy-based methods directly get to the problem at hand (estimating the optimal policy), without having to store a bunch of additional data (i.e., the action values) that may not be useful.
    2. Stochastic policies: Unlike value-based methods, policy-based methods can learn true stochastic policies.
    3. Continuous action spaces: Policy-based methods are well-suited for continuous action spaces.